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Abstract. Consider a Markov chain defined on a finite state space, X ', that leaves invariant the uniform 
distribution on X, and whose transition probabilities are integer multiples of 1/Q, for some integer Q. I 
show how a simulation of n transitions of this chain starting at xq can be viewed as applying a random 
permutation on the space X xU, where U = {0, 1, . . . , Q— 1}, to the start state (xo, Uq), with uq drawn 
uniformly from U. This result can be applied to a non-uniform distribution with probabilities that 
are integer multiples of 1/-P, for some integer P, by representing it as the marginal distribution for 
X from the uniform distribution on a suitably-defined subset of X x y, where y = {0, 1, . . . , P — 1}. 
By letting Q, P, and the cardinality of X go to infinity, this result can be generalized to non-rational 
probabilities and to continuous state spaces, with permutations on a finite space replaced by volume- 
preserving one-to-one maps from a continuous space to itself. These constructions can be efficiently 
implemented for chains commonly used in Markov chain Monte Carlo (MCMC) simulations. I present 
two applications in this context — simulation of K realizations of a chain from K initial states, but with 
transitions defined by a single stream of random numbers, as may be efficient with a vector processor or 
multiple processors, and use of MCMC to improve an importance sampling distribution that already has 
substantial overlap with the distribution of interest. I also discuss the implications of this "permutation 
MCMC" method regarding the role of randomness in MCMC simulation, and the potential use of 
non-random and quasi-random numbers. 



1 Introduction 

Markov chain Monte Carlo (MCMC) simulation might seem to be a fundamentally contractive process. 
A simulation started from a broad initial distribution must, after many transitions, be concentrated 
in the possibly much-smaller region that has high probability under the equilibrium distribution being 
sampled. This implies that the random map determined by the random numbers underlying the Markov 
chain transitions must be contractive — for a finite state space, it must map a large set of states to 
a smaller set of states, and for a continuous state space, it must map a set of large volume to a set 
of smaller volume. This contractive property underlies the ability to "couple" a set of chains so that 
eventually they all "coalesce" to the same state, as is exploited by methods such as coupling from the 
past (Propp and Wilson, 1996) and circular coupling (Neal, 1999/2002). 
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In this paper, I show that with a simple extension of the state space this contractive behaviour can be 
converted to a non-contractive map — which is a permutation when this extended state space is finite, 
or a one-to-one map that preserves volume when the extended state space is continuous. This result 
was suggested by the volume-preserving property of Hamiltonian dynamics (see Neal, 2010), which 
can be used to define an importance sampling procedure based on annealing (Neal, 2005). I expect 
that the result in this paper can be applied to produce similar procedures that combine annealing and 
importance sampling using other MCMC techniques. However, I will leave that for future work, and 
instead present two simpler applications of what I will call "permutation MCMC" . 

One application is to parallel simulation from many initial states using a single stream of random 
numbers. Using a single random number stream for all parallel chains may reduce the computational 
cost, perhaps especially if the parallelism takes the form of vector operations. (At worst, it costs the 
same as using multiple streams, if due to high communication cost it is fastest to compute the same 
stream separately in each processor.) Using a single stream also avoids the issue of how to set up multiple 
streams that are unrelated, a problem that is discussed, for example, by Wu and Huang (2006). 

However, some ways of using a single random number stream to define transitions in parallel chains 
lead to the chains coalescing to the same state, or to states that approach each other increasingly closely, 
eliminating the benefit of multiple chains in producing better estimates. I will demonstrate that this is 
avoided when transitions are defined as random permutations, or as random volume-preserving maps. 

A second application is to improving importance sampling, in which expectations with respect to 
some distribution of interest are found using points drawn from some approximating distribution that 
is easier to sample from. This method produces good results only when the importance sampling 
distribution is a sufficiently good approximation, and most crucially does not give very low probability 
to regions that have significant probability under the distribution of interest. This can be hard to 
guarantee in high-dimensional problems. We might improve an importance sampling distribution that 
is close to being adequate — in the sense that it at least has substantial overlap with the distribution of 
interest — by performing some number of transitions of a Markov chain that leaves the distribution of 
interest invariant starting from a point drawn from the original importance sampling distribution. This 
will improve the approximation even if the random numbers used to simulate this Markov chain are 
fixed, provided we choose the number of transitions randomly, so that the final importance sampling 
distribution is a mixture of distributions after varying numbers of transitions. With standard methods 
of simulation, however, computing the importance sampling probabilities (or densities), as needed to 
find appropriate weights, will often be infeasible, because the same final point might be produced from 
several initial points (or for a continuous distribution, an unknown change in volume may alter the 
densities). I will show how this problem can be bypassed by viewing the transitions as applying a 
random permutation (or volume-preserving map), for which the probability (or density) of the final 
point is the same as that of the initial point. 

Recently, Murray and Elliott (2012) independently devised MCMC simulation methods equivalent or 
similar to some of the methods I present below, though without additional variables needed to produce 
a volume-preserving map. Their aim was to find a simulation method that is insensitive to dependence 
in the stream of random numbers used, or even to whether they are actually random. I conclude this 
paper by also discussing what permutation MCMC says about the role of randomness, and how MCMC 
efficiency might be improved by using permutation MCMC with non-random or quasi-random numbers. 

The programs and scripts used for the experiments in this paper are available from my web page. 
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2 Viewing MCMC for a uniform distribution as a random permutation 



I will begin with the simple case of a Markov chain that samples from the uniform distribution on some 
finite state space. In the following sections, I generalize to other discrete and continuous distributions. 

Consider a Markov chain on some finite state space, X, which we can take to be {0, . . . , M— 1}. Let 
the probability of this chain transitioning to state x' when the current state is x be T(x,x'), and for 
the moment assume these transition probabilities are integer multiples of 1/Q, for some integer Q. We 
wish to use this chain to sample from the uniform distribution on X, so T will be chosen to leave this 
uniform distribution invariant — that is, 

£(l/M)T(x,x') = 1/M (1) 

If we view T as a matrix, this condition is equivalent to all its columns (as well as all its rows) summing 
to one. 

A standard way to simulate a realization, xo, x±, X2, ... of this chain, starting from some state xq, is 
to draw uq, u ii u 2, ■ ■ ■ independently from the uniform distribution on U = {0, 1, . . . , Q — 1} and then 
set 

x'-l 

x i+ i = maxjV: Q s ^T(x i ,x) < (2) 

x=0 

The first step in converting this simulation to a random permutation is to extend the state space 
to X x 1A. We then draw a value for uq uniformly from U. Subsequent transitions from (xq,uo) are 
defined using sq, si, S2, ■ ■ ., which are independently drawn uniformly from U. (We will see below that 
sq,. . . ,s n specify a random permutation mapping (xq,uq) to (x n ,u n ).) From the state (xi,Ui), Xi + \ is 
derived from Xi and U{ as in equation (J2j) above, and Ui + \ is derived from Xi, m, Sj, and x^+l as follows: 

Xi+i—l Xi~l 

u l+ \ = Si + m - Q ^2 T ( x h x ) + Q T (. x i+l' x ) (mod Q) (3) 

x=0 x=0 

where T(x,x') = T(x',x) are the transition probabilities for the reversed chain. 

To see informally the rationale for this, note that the terms on the right other than s, define a value 
for u that would lead back to Xj if an equation analogous to (j2J) were applied with T replaced by T. 
This part of the map from (xj,Uj) to (xj + i, Uj+i) is therefore a permutation. Adding Sj modulo Q is a 
random circular permutation, so the full map from (xi,Ui) to (xj+i,nj+i) is a random permutation as 
well. For any n > 0, the map from (xo, uo) to (x n ,u n ), being a composition of random permutations, is 
also a random permutation. 

Furthermore, if we look at only a single realization of the chain, setting Ui+x to an independent 
random Sj plus anything (mod Q) has the same effect as setting independently at random, so the 
joint distribution of (xi,ni), (^2,^2), ... is the same as for the standard method of simulation. 

Appendix A shows in detail that the map (xj,Uj) — > (xi+\, Uj+i) defined by equations ([2]) and ([3]) is 
a permutation, by explicitly exhibiting the inverse map. 
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Here is a matrix of transition probabilities for a simple example with M = 4 states: 



" 2/3 1/3 
T 1/3 1/3 1/3 

1/3 1/3 1/3 
. 1/3 2/3 

All the columns above sum to one, so these transitions leave the uniform distribution on X = 
{0, 1, 2, 3, 4} invariant. Since all the transition probabilities are multiples of 1/3, we can set Q = 3, and 
hence U = {0, 1, 2}. Note that these transitions are reversible — that is, T(x, x') = T(x' , x) = T(x, x'). 

The permutation maps from to (xj + i,Uj + i) when Sj has each of its possible values are shown 

here: 

XXX 



0123 0123 0123 




Si = Si = 1 Si = 2 



In these diagrams, the array of circles represents all possible (x, u) pairs, and the arrows show how such a 
pair for (xi, u{) is mapped to Uj+i). For example, the arrow out of the state with Xi = 1 and Ui = 2 

goes to a state with Xj+i = 2, regardless of the value of Sj, since Q (T(l, 0)+T(l, 1)) = 3(2/3) = 2 < 2, 
so the maximum in equation ([2]) will be x' = 2. When Sj = 0, equation (|3|) will find a value for m+i 
that would lead back to the state Xi = 1 starting from x%+i = 2, which requires that Ui+i be at least 
Q T(2, 0) = 0, to which must be added the amount by which m was greater than the minimum needed 
for the transition to Xj+i = 2 to be taken (essential to avoid two states mapping to the same new state), 
which in this case is 0. The result is that the diagram for Sj = has the transition (1,2) — > (2, 0). 

The maps for Sj ^ can be obtained from the map for Sj = by circularly shifting the Note 
that when, as here, the transitions are reversible, the diagram for Si = will consist entirely of single 
states with arrows pointing to themselves and pairs of states connected by arrows both ways. 

For comparison, here are the maps produced by the T defined in equation when equation ^ is 
replaced by Ui + \ = Si + Ui (mod Q): 

XXX 



0123 0123 0123 




Si = Si = 1 Si = 2 



Some states have zero or two incoming arrows, so these are clearly not permutations. 



(4) 
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Below, are the transition probabilities, T, for a non-reversible Markov chain with M = 4 states 
that leaves the uniform distribution on X = {0,1,2,3,4} invariant, along with the reverse transition 
probabilities, T, found by transposing T: 



1/2 1/2 

1/4 1/4 1/4 1/4 

1/2 1/2 

1/4 1/4 1/4 1/4 



1/2 1/4 1/4 

1/2 1/4 1/4 

1/4 1/2 1/4 

1/4 1/2 1/4 



(5) 



Since all transition probabilities are multiples of 1/4, we can set Q = 4, so that IA = {0, 1,2,3}. The 
permutation maps for this example from (xj,Uj) to (a^i+i, Ui+i) for each Sj are as follows: 



Si = : u 




Si — 1 : a 




Sj — 2 : u 




Si = 3 : u 




In this example, the value of Xi+i that follows (xi,Ui) is determined using equation ([2]) in the same 
way as for a reversible chain, but the value of V4+1 when Si = is not one that would lead back to 
Xi if T were applied starting from x«+i, but is rather a value that would lead back to Xi if the reverse 
transition, T, were applied. 

In real MCMC applications, unlike these examples, the state space is enormous, and transition 
probabilities are defined algorithmically, rather than via an explicit table. One may then ask whether 
the computation of Xi + \ and Ui + \ from Xj and Ui according to equations (JSJ) and ([3]) is feasible. I 
will defer consideration of this issue to the following sections, in which the method is generalized to 
non-uniform distributions and to continuous state spaces. 
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3 Generalization to non-uniform discrete distributions 



As a first step in generalizing the result in the previous section, let us consider a distribution on 
X = {0, . . . , M — 1} with probabilities proportional to a function tt(x) whose values are all integer 
multiples of 1/P, for some positive integer P. (We may not know the constant of proportionality, 
1/ Y2 X 7r ( 2; )-) This distribution can be obtained as the marginal distribution on X obtained from a 
uniform joint distribution on the following subset of X x y, where y = {0, . . . , P— 1}: 

Z = {(x,y) : < y < Ptt(x)} (6) 

The cardinality of Z is M + = P^7r(x). 

X 

This construction is analogous to what is done for "slice sampling" MCMC methods (Neal, 2003), in 
which Markov transitions are defined on this extended state space. Here, I will assume that our MCMC 
method is defined in terms of transitions on X, with the introduction of the extended space X x y being 
only a device to allow these transitions to be expressed as permutations. (However, transitions defined 
on X x y could be accommodated if desired.) 

Suppose that we have defined a Markov chain on X, with transition probabilities T(x, x'), all integer 
multiples of 1/Q, that leaves the distribution tt(x) invariant. We can define a Markov chain on Z that 
leaves the uniform distribution on Z invariant, with transition probabilities as follows: 



T + ((x,y), (x',y')) = 

These transition probabilities are all integer multiples of 1/Q + , where Q + = (maxP7r(x))! Q. 
To confirm that T + leaves the uniform distribution invariant, note that for any (x',y') in Z, 

Y -^—T + ((x,y),(x',y')) = -L V Ptt(x) T ( x > x ') = 1 1 V" n(x)T(x, x') = (8) 

(x,y)eZ x v ' v > x 

The effect of applying the original transitions, T, starting from some initial state, xq, can be du- 
plicated by drawing yo uniformly from {0, . . . , Ptt(xq)} and then using the transitions T + to simulate 
states (x\,yi), (x2,y2), ■■■ The resulting distribution for x\, x%, ... is the same as if T were applied 
starting with xq — the transition from (xi,yi) to (2^4. i, yi+i) defined by T + ignores j/j, and gives equal 
probabilities of T(xi, Xi + i) / P-rr(xi + i) to P7r(xj + i) values of y , so the total probability for a value xi + \ 
to follow Xi is T(xj,Xj + i). 

An MCMC simulation using T that samples from X with probabilities given by tt can therefore be 
replaced by a simulation using T + that samples from Z with uniform probabilities. This simulation on 
the extended state space Z can be expressed as a random permutation, as described in Section [2 To do 
this, we must decide on an ordering of states in Z. In this paper, I will use a lexicographical order (first 
on x, then on y), in which an (x, y) pair in Z is associated with a label, x + , in X + = {0, 1, . . . , M + — 1}, 
according to the following map: 

x-l 

X+(x,y) = y + P^vr(x) (9) 

x=0 



• B^£i if < y> < Pvr(x') 

Pir(x') (7) 

otherwise 
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The inverse of this map takes x + to (X(x + ), Y(x + )), where 



x-l 



X(x + ) = max|x : -P^^7r(x) < x + |, Y(x + ) 



x 



2=0 



X(x+)-l 

p J2 *m 

x=0 



(10) 



We can now apply the method of Section [21 replacing references there to X, M, T, and Q with 
references to X + , M + , T + , and Q + , after redefining T + to apply to states in X + rather than the 
associated pairs in Z. In terms of T + as redefined in this way, the transitions defined by equation ([2]) 
and equation ([3]) become 



x+'-l 



max jx +/ : Q + T + {xf , x + ) < Ujj 



(11) 



x+=0 



u i+1 = Sl + u % - Q + T + (x+,x + ) + Q+ T + (x+ +1 ,x + ) (mod Q+) (12) 

x+=0 x+=0 

where T + (x,x') = T + (x',x). 

It is convenient to re-express these transitions in terms of (x, y) and the original transitions T. Noting 
that X + (x,y) > X + (x',y) when x > x' and X(x + ) > X(x + ') when x + > x +/ , we can write one of the 
sums appearing above as follows, with x% = X(xf), x%+i = X(xf +l ), and yi+\ = Y(xf +l ): 

Q+f]T+(x+,x + ) = Q+[J2 Y, T+ ( x t> X + (x,y)) + ^ T+(x+, X+(x i+1 ,y))] (13) 

x=0 y=0 

Xi + l-1 



x+=0 



Vi+i 



x=0 



P7r(x i+ i) 



T(xj,Xj+i) 



(14) 



To rewrite the other sum above, we let yi = Y(xf), and define the reverse transition probabilities for 
the original chain as T(x,x') = T(x', x)ir(x')/ir(x). We then have 

X+-1 



.Xi-l 



Q + ^T+(x+ 1 ,x+) = Q+[^T(x m ,x) + 



x+=0 x=0 

Also, note that yi/Pir{xi) and y%+i/ Pir(xi + i) are less than one. 



Plt{Xi 



T(x i+1 ,Xi) 



(15) 



The transitions of equations (]lip and (112f> can now be rewritten using expressions (|14() and (|15p for 
the sums they contain, as follows: 



x'-l 



+1 = max|x': Q + T(xj, x) < Ujj 

Xi+l— 1 

P7r(x i+ i) (in - Q + ^ T(xi,xj^ J (Q + T(xi,x i+1 )) 

x=0 

Xi+i — 1 

+ i = Si + Ui - Q + T(xi,x) - Q + T(xi,x i+ i 



Vi+i 



(16) 
(17) 



z=0 



Q + ^T(x i+ i,x) + (5 + T(x i+ i,: 



x=0 



P7r(x i+ i) 



Vi 

PTT(Xi) 



(mod Q 



(18) 
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Defining y* = y/P, so that y* is in [0,ir(x)), and letting u* 
can rewrite the above equations as follows: 



u/Q + and s* = s/Q + , both in [0, 1), we 



x i+l 



Vi+i 



max 



[x' 



x'-l 

E 

x=0 



u 



i+l 



Pir(x i+ i 



i * 

+ Ua 



T( Xi ,x) < <} 

Xi+l—l 

) (^i ~ ^2 T ( X i' X )) I T(Xi,X i+1 ] 



p 



(19) 



(20) 



x=0 



^2 r ( x *' x ) 

x=0 

Xi-1 



T(xi, x 



i+l) 



Vj+1 

ir(x i+1 



+ y~] T(x i+1 ,x) + T(x i+1 ,Xi) 



x=0 



ir(xi 



(mod 1) 



(21) 



where [/(mod 1) means U — \ U\ , the fractional part of U. 

If we now let P (and hence also M + and Q + ) go to infinity, which corresponds to letting the probabil- 
ities tt(x) take on any real values in [0, 1], we get a simpler expression for y* +1 , which when substituted 
into equation (|21|) gives a simpler expression for u* +l . The final result is the following transition: 



X —1 

Xi + \ = maxjx': ^^T(xj,x) < u*j 



x=0 



Vi+i 



Xi+i — 1 



n(xi+i) (u* - ^2 T(xi,xj\ I T(x.i,x i+ i) 

2 = 



u 



i+l 



+ 



Xi-l 

E 

x=0 



T( x i+li x ) + T{ x i+li x i 



ir(xi) 



(mod 1) 



(22) 



(23) 



(24) 



In this limit, the values Sq,^,^, . . . that determine the random transitions are drawn independently 
from the uniform distribution on [0, 1). Note that this transition preserves the property that y*+i is in 
[0,7r(xj+i)). In a program, it may be more efficient to maintain the quantity y*/n(x) rather than y*. 



Since the map defined by equations (|22p to (|24p is a limit of discrete permutation maps, for an 
increasingly fine uniform grid, it should not only be one-to-one, but also preserve volume. This is shown 
in Appendix B, by explicitly exhibiting the inverse of the map, and showing that its continuous part 
has a Jacobian matrix whose determinant has absolute value one. Note, however, that a computer 
implementation of this map that uses floating-point representations of y* and u* may not be exactly 
reversible, or may not exactly preserve volume, due to round-off error. 

This map is similar to one defined by Murray and Elliott (2012), in their equations (9), (10), and (17). 
However, they have no equivalent of y*, and their update for u* is what would be obtained by replacing 
y\ in equation (f24"|) above by the definition of y* +1 from equation (j23j) . Because of this difference, Murray 
and Elliott's map does not preserve volume. This may not matter for their purpose of producing an 
MCMC method that works with dependent random number streams. 



As an example of the map defined by equations (|22p to (|24p . consider a state space of X = {0, 1, 2} 
with probabilities 7r(0) = 3/10, 7r(l) = 1/10, and ir(2) = 6/10. Let the transition probabilities, T, and 
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their reversal, T, be given by the following matrices: 



1/3 1/3 1/3 
1 
_ 1/3 2/3 _ 



T 



1/3 2/3 
1 
_ 1/6 1/6 2/3 



(25) 



The diagram below shows how (xj, y*, u*) is mapped to (x i+ i, u* + i) when Sj = 0: 
u* 1 u* 







x=0 



x=l 



(x,y*) 



x=2 



A B 

x=0 

C D 



U 

E F 
x=l 

G H 



I J 

x=2 

K L 



x=2 



R 



x=0 



U 



x=2 



T W 



A 


C 


Q 






S 


x= 


=0 






x=2 




B 


D 


R 






T 


E 
F 






x=l 




G 
H 


I K 


M O 


U 






W 


x=0 


x=l 






x=2 




J L 


N P 


V 






X 



1 

The horizontally dimension of each of the two squares above represents the range of [0, 1) for u* . The 
vertical dimension represents the range of (x, y*), with [0, 1) divided into a section for each value of x of 
size tt(x), with the value of y* going from to ir(x) within each such section. Rectangles in the square 
on the left correspond to transition probabilities, T(xi,Xi+i). Points in such a rectangle are mapped to 
points in a rectangle of equal area in the square on the right, as identified by the letters labelling the 
corners. Such rectangles on the right correspond to reverse transition probabilities, T(xj + i,Xj). 

If n(x) is only proportional to the probabilities, not equal, the the vertical scale in the diagram above 
would be stretched or compressed by the factor Yl^fa)- 



Often, the update defined by equations (|22|) to (|24[) will be implementable nearly as efficiently as a 
standard update for the same transition probabilities, in which just equation ([22]) is used with a value 
for u* drawn uniformly from [0, 1). If the search over x' involving the sum in equation (|22p can be done 
efficiently, it will usually deliver the value of this sum for Xi+i at little or no extra cost, as needed for 
equation (|23p . The sum in equation (124p involving the reversed transition probabilities will typically be 
feasible if the analogous sum for the original transition probabilities is feasible, certainly so if the chain 
is reversible, so that the original and reverse transition probabilities are the same. 

In one common class of problems, X = X\ X • • • x Xd, so that a state x is composed of values for 
d discrete variables, with the number of possible values for each of these variables (ie, the cardinality 
of each X^) being small. One popular approach to sampling a distribution on such a state space is to 
simulate a Markov chain that applies d updates in succession, each of which changes only one component 
of x. Examples of such problems are the Ising and Potts models of statistical physics, and Bayesian 
mixture models with conjugate priors in which the continuous parameters have been integrated out, 
leaving only discrete class indicators for each data point (as in Algorithm 3 in (Neal, 2000)). 
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Denoting the transition probabilities when only component k of the state is updated by Tk, we see 
that for problems of this sort Z)%(x,x') will be non-zero for only a small number of values for x' (at 
most the cardinality of X^), even when the cardinality of X is enormous. Furthermore, the values of x' 
for which T^^x, x') may be non-zero are easily identified, so the sums in equations (|22p to (|24p can be 
efficiently computed by explicit summation. 

The Metropolis-Hastings algorithm (Metropolis, et al, 1953; Hastings, 1970) is another popular way 
to define a Markov chain on X that converges to the distribution defined by tt(x). This method draws a 
proposal, Xi, for the state to follow Xi according to some proposal distribution with probabilities given 
by S(xi,Xi), and then accepts this proposal with probability ), where 



If the proposal is not accepted, Xi + \ = X{. The resulting transition probabilities are reversible, and are 
given by 



Standard simulation of such a chain is feasible if one can sample from S(x, x) and compute ir(x) and 
S(x, x) (with the latter not required if S(x, x) = S(x, x) so that their ratio in the acceptance probability 
is always one). Simulation using equations (f22|) to (plj) might be more difficult, however, because 
computation of T(x, x) involves the rejection probabilities (whose computation requires evaluating it) 
for all possible proposals. 

This problem can be bypassed by changing how u* is used to determine x i+ \. In equation (|2"2"j) . 
the full range of u* is partitioned into contiguous sub-intervals associated with each value for Xi + \. 
Instead, we can partition the range of it* into subintervals corresponding to proposals, with sizes given 
by S(xi,X{), and then further subdivide each of these subintervals into a part associated with acceptance 
of the proposal and a part (possibly empty) associated with rejection. A self-transition, with probability 
T(x, x), is then represented by the union of an interval of size S(x, x), corresponding to proposing (and 
accepting) the same state as the current state, and zero or more intervals associated with rejection of 
proposals to move to a different state. 

In detail, a Metropolis-Hastings transition can be performed as follows: 




(26) 





(27) 



x'-l 




(28) 



x=0 




(29) 



x=0 




(30) 




(31) 
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u 



i+1 



Xi-1 



s* + y S(x i+ i,x) + S(x i+ i,Xi)a(x i+ i,Xi) 1 (modi) if aj+i < a(xi, x i+ i) 

^ 7r(xi) (32) 



x=0 

s* + u* (mod 1) 



otherwise 



Appendix C shows that this map is indeed one-to-one and volume preserving. 

As an example, consider Metropolis-Hastings transitions that leave invariant the distribution on 
X = {0, 1, 2, 3} that has probabilities 



tt(0) = 1/3, tt(1) = 1/3, vr(2) = 2/9, vr(3) = 1/9 



(33) 



using the matrix of proposal probabilities, S(x,x), on the left below, which produces the transition 
probabilities, T(x,x'), shown on the right: 



1/2 1/2 

1/3 1/3 1/3 

1/3 1/3 1/3 

1/2 1/2 



1/2 + 1/6 1/3 

1/3 1/3 + 1/9 2/9 

1/3 1/3 + 1/12 1/4 

1/2 1/2 



(34) 



The resulting map from (xi,y*,u*) to (xj+i, y* + i, u* +1 ) defined by equations (|29|) to ([32]) . when = 0, 
is pictured below: 

u* 1 

r- 



A B 



x=0 



x=0 



x=l 



A 
B 



(x,y*) 



x=l 



x=0 



_c_ 

G 



x=2 



x=l 



x=3 



H 
L 

M 



F G 



H • • 



x=l 



x=2 



K L 



J • • 

m|m 



x=2 



x=2 
O P 



x=3 

• N 

x=3 



1 

Since Metropolis-Hastings transitions are reversible, this one diagram shows both the transition and 
its reversal. A corner of a rectangle marked by a black dot is not moved by the map; a corner labelled 
by a letter is mapped to the other corner with the same label. The shaded rectangles are associated 
with rejected proposals. 

Implementing a Metropolis-Hastings transition in this fashion will often be feasible, since the sums in 
equations (|28p . (|29p . and (|32p involve only the proposal probabilities, which will often have a tractable 
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form. However, if such an implementation is still costly, an alternative is to express the proposal 
distribution as a mixture of simpler proposal distributions, as follows: 

S(x,x) = ^2 p(5) S 5 {x,x) (35) 

<5 

Here, p(6) are the probabilities for some set of simpler proposal distributions, Sg. Rather than use S 
itself as a proposal distribution, we perform transition i by randomly selecting a value 5i according to 
p{6) and then performing a Metropolis-Hastings transition with Sg t as the proposal distribution. (Note 
that it is generally valid to choose from a set of transition probabilities randomly at each transition 
of the chain, as long as each transition leaves tt invariant, and the choice is made independently of 
other choices.) We consider the sequence 60,61,62, ■■ ■ to be part of the specification of the random 
volume-preserving map (along with Sq, s*, s^, ■ ■ ■)■ 

For example, a "random walk" Metropolis transition can be defined when X has a group structure, 
with operations © and ©. We can then let 6 be an element of this group, and define 



Sg(x,x) 



1/2 if x = x®5 

1/2 if x = xQS (36) 
otherwise 



This is symmetric, with Sg(x,x) = Ss(x,x), so the acceptance probability of equation ([26]) simplifies 
to min[l, n(x)/ir(x)]. It is easy to implement equations (|2"5|) to (|3"2"j) when S is set to such an S$- The 
probabilities p(6) will control the sizes and directions of jumps that are proposed. 



4 Generalization to continuous distributions 

I will begin generalizing to continuous distributions by considering a Markov chain on a state space that 
is an interval of real numbers, which will be denoted as X* = (a, b). I will assume that the chain leaves 
invariant a distribution that has a density function proportional to tt*(x*), and that the transitions of 
the chain can be expressed as conditional density functions, with T*(x*, x*') the conditional density 
for the state x*' to follow the state x* . I also assume that the total variation distance between the 
transition density functions T*(x* , •) and T*(x* + e, •) goes to zero as e goes to zero — that is, a slight 
perturbation in the current state has only a small probability of altering what state follows — except 
perhaps at a finite set of values for x* . 

A chain with a finite state space X = {0, . . . ,M— 1} can approximate the above chain arbitrarily 
well as M goes to infinity. We associate an x* G X* with each x S X according to a;* = a + (x + 1/2) /i, 
where h = (b — a)/M, and define the transition probabilities for the approximating chain by 

a+h(x'+l) 

T(x,x') = J T*(a + {x + l/2)h, x*)dx* (37) 

a+hx' 

This corresponds more-or-less to how simulations of Markov chains for continuous state spaces are 
actually done on computers, which can represent numbers only to some finite precision. 

We can define a distribution on X with probabilities proportional to tt{x) = 7r*(a + (x + l/2)h). For 
any finite M, the transitions of equation (|37p will in general not leave this distribution exactly invariant. 
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However, we can nevertheless use tt(x) to define transitions on an extended space using equations ([22]) 
to (|24p . In the limit as M increases and hence h approaches zero, we can write the transition probabilities 
as T(x,x') rj hT*(a + (x + l/2)h, a + (x' + l/2)h) = T*(x*,x*'). Equations ([22]) and (JMD can then 
be written as follows: 

x* +1 = max{x*':£r(x*,^)<<} = F^ 1 (x*,u*) (38) 
<+i = 4+ f X ' f*(x* +l ,x*)dx* (modi) = s* + F*(x* +1 ,x*) (modi) (39) 

J a 

Here, F*(x,x') = ff T*(x,x') dx' gives the cumulative distribution functions for the transition 
distributions, and .F" 1 is the inverse of F* with respect to its second argument (which may be defined 
as in equation (|38|) for those u that correspond to more than one x'). The chain's reverse transition 
probabilities are denoted by T*(x* ,x*') = T*(x*' ,x*)tt(x*')/it(x*), and F*(x,x') = f*' f *(x, x') dx' 
gives the cumulative distribution functions for these reversed transitions. 

Note that y* does not appear in these transition equations. It appears in a term on the right of 
equation ([24"]) . but this term disappears as h goes to zero. The update for y* in equation ([23]) becomes 
undefined as h goes to zero, as it depends on low-order parts of u* that disappear in this limit. 

However, with y* eliminated, the transition on the space of x* and u* defined by equations ([38]) and 
(|39p does not preserve volume. To fix this, we need to retain the part of u* that disappears in the limit 
as an additional part of the extended state, denoted as u*, a value in [0, 1). We also introduce values 
£q, tj, t^, ■ ■ ., which like Sqi s i\ s 2> • • • are independently drawn from the uniform distribution on [0,1), 
and which form part of the specification of the random one-to-one volume-preserving map that we can 
view the transitions as being. We can then define the missing part of the transition as follows: 

y* +1 = tt*(x* +1 )v* (40) 

v*+x = *i + Vi /**{**) ( m od 1) ( 41 ) 

Appendix D shows that equations ([55]) to (|4ip define a map on the state space { (x*, u*, y*, v*) : x* G 
(a, b), u* G [0,1), y* £ [0,7r* (x*)), v* G [0,1)} that is one-to-one and volume preserving, with s* and 
t* considered fixed. It follows from this that the transition leaves invariant the distribution in which 
x* has marginal probabilities given by tt*, y* given x* is uniform over [0, ir*(x*)), and u* and v* are 
independent of the other parts of the state and uniformly distributed over [0, 1). Note that it may be 
more efficient in a program to maintain y*/-rr(x*) rather than y*, since equations (|40|) and (|41|) then 
become simply (y* +1 /7r(x| +1 )) = v* and v*,± = t* + (y*/-7r(x*)) (mod 1), eliminating a multiply, a 
divide, and the need to evaluate tt. In some contexts, such as parallel simulation as discussed below, it 
may not be necessary to maintain y* and v* at all, since they are not needed when updating x* and u*. 

This map can easily be adapted to a state space of all real numbers by letting a go to — oo and b go 
to oo. 

The map defined by equations ([55]) and ([59"]) is the same as that defined by Murray and Elliott 
(2012) in their equations (9), (10), and (11). They do not include variables equivalent to y* and v* , 
however, and so do not produce a volume-preserving map. For their application, and others such as 
parallel simulation with a single random number stream, this may not matter, but volume-preservation 
is needed for the importance sampling application discussed below. 
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MCMC is rarely used to sample one-dimensional distributions, since other methods of sampling are 
generally preferable, but the map described above can also be used to express an update of a single 
real component of a multi-dimensional state. Gibbs sampling (Gelfand and Smith, 1990) is one popular 
method of this sort, in which the transition updating component j of a multi-dimensional state, w, sets 
it to a value drawn independently from the conditional distribution of component j given the current 
values of other components. The map defined by equations ([55]) to dUJ simplifies in this case, since the 
transitions are reversible, and ignore the previous value of component j. 

Let tt(w) be proportional to the probability density for the multi-dimensional state w, and define 
\j x* to be state w with component j re] 
in the transition from state Wi as follows: 

7T (X ) = 1T{Wi \j X 



w \j x* to be state w with component j replaced by the value x*. We can then define tt* and F* for use 



TT (X 



{x") dx" i rvwx 



(42) 
(43) 



The update for component j based on the map of equations (f38|) to (jHJ can then be expressed as 

K\<) (44) 
s* + F*(x*) (mod 1) (45) 
7r(4fiK (46) 
t* + y*/7r*(x*) (mod 1) (47) 



x i+l 
* 

u i+l 

* 

Vi+l 



The multi-dimensional state after this update is w^i = 
The diagram below shows such a transition (when 



W i \j X i+1- 



and U = 0): 




yi+i 



v i+l 




















\ 








-4= 











Yi 



The left plot shows how, in this example, a square region of values for x* and u* maps to a smaller 
rectangular region for x* +1 and u* +1 , according to the F* that is plotted. The right plot shows how a 
rectangular region of values for y* and v* maps to a larger square region for y* +1 and v* +1 , according 
to tt(x*) and tt(x* +1 ), represented by the slopes of the lines shown, which in this example match the 



slopes of the corresponding parts of F* (as happens when J* 7r 



i dx* 



1; in general they differ 
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by a constant factor that cancels). Note that the area on the left and the area on the right change by 
factors that cancel, so the transition preserves volume. 

Transitions defined by equations (|44p to (|47p can be efficiently implemented when the conditional 
cumulative distribution function, F*, and its inverse can be computed efficiently In comparison, a 
standard implementation of Gibbs sampling will be efficient whenever it is possible to efficiently sample 
from this conditional distribution. For some distributions, such as the exponential and normal distri- 
butions, computing F^T 1 ^) for a u drawn from the uniform distribution on [0, 1) is close to being the 
most efficient way of generating a random variate, in which case equation (|44p has the same cost as 
a standard Gibbs sampling update. The cost of doing a volume-preserving Gibbs sampling update is 
then the extra time for equations (|45p to (|47p . or only the time for equation (|45p in a context, such 
as discussed below for parallel simulation, where we do not actually need to evaluate equations (|4*6"j) 
and gZp . 

Metropolis updates are another common way of defining transitions for multi-dimensional states. The 
proposal distributions used may change only a single variable, or a subset of variables, or all variables 
simultaneously. If we regard such a proposal distribution as a mixture of proposal distributions Sg, as 
in equation (|35p . with the states proposed by each Sg coming from a small set, then a volume-preserving 
form of the transition can be obtained in the same way as described there for a discrete state space. In 
particular, random-walk proposals as in equation (|36[) can be used, with the operation © being ordinary 
scalar or vector addition, or some other group operation such as addition modulo 1. 

5 Application to parallel or vectorized simulation 

As a first application of permutation MCMC, I will show how it can be used in parallel Markov chain 
simulations from multiple initial states, done with a single stream of random numbers. Using a single 
stream of random numbers may be faster than using separate random number streams, especially if 
the parallelization takes the form of vector operations, and it also avoids the problem of ensuring that 
multiple random streams are independent (see the discussion by Wu and Huang (2006)). However, if a 
single random stream is used for MCMC simulation in the standard fashion, the chains will often coalesce 
to the same state at some time, and thereafter track each other exactly, or they may approach each other 
and then move together, with states that are similar, but not identical. This would eliminate much or all 
of the benefit of simulating multiple chains. Such behaviour is not possible with permutation MCMC, 
as it would require multiple states to map to the same state, or for the mapping to be contractive, and 
hence not preserve volume. 

To illustrate parallel simulation using permutation MCMC for a discrete state, I will use an Ising 
model — which has seen extensive use in statistical physics as a model of magnetization, and which is 
also used in image restoration (Geman and Geman, 1984). 

In the Ising model, the state variables — called "spins" in the physics application — can take the 
values —1 and +1, and are arranged in a two-dimensional array with r row and c columns. The joint 
distribution over the n = rxc spins, x^ , . . . , x^ , is defined in terms of the following "energy" function: 



where M is a set of unordered pairs of spins that are "neighbors" — adjacent either horizontally or 




(48) 



(a,b) £j\f 
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vertically. A spin in the leftmost column is adjacent to the spin in the rightmost column in the same 
row, and similarly for spins in the topmost and bottommost rows, so that all spins have four neighbors. 

The probability of a state, x, in the Ising model is defined using this energy function as follows: 

P(x) = tt(x)/Z/3, withvr(x) = exp(-/3£(x)) (49) 

Here, j3 is some constant, and Zp = Y^ x 7r(x) is the normalizing factor for the un-normalized probabili- 
ties, tt(x). 

Gibbs sampling is one commonly-used MCMC method for Ising systems. Each spin is updated in 
some order, with the new value for the spin being randomly drawn from its conditional distribution 
given the values for the other spins, which is as follows: 

P( x ( a ) = +l | for b ^ a) = [l + exp (- 2(3 ^ xW ) 1 ~ ( 50 ) 

b s.t. 

(a,b)eAf 

Since for each update there are only two possible next states, both standard Gibbs sampling and 
permutation Gibbs sampling using equations (|22p to (|24p are easily implemented. Appendix E contains 
an R program for implementing vectorized Gibbs sampling for the Ising model, both in the standard 
way, with as many random number streams as chains (all subsets of one stream in this implementation), 
or in the standard way, but with the same random stream for all chains, or using a single random stream 
with permutation MCMC. Note that the current value of the spin being updated is ignored, and that 
each update is reversible (though the sequential combination of such updates is not), which lead to 
some simplification of the general permutation MCMC transitions of equations (|22|) to (j2l 



For this demonstration, I used a 4x 5 array of spins and set j3 to 0.4. Figure Q] shows six Gibbs sampling 
simulations (distinguished by colour) done using standard MCMC (with different random numbers for 
each simulation), coupled MCMC (same random numbers with standard method), and permutation 
MCMC (same random numbers using permutation updates). All simulations were started in a state 
where each spin was set independently, with equal probabilities for —1 and +1. For permutation MCMC, 
initial values for u* and for y* /ir(x) were drawn from the uniform distribution on (0, 1). 

The left plots in Figure Q] show the total energy, the right plots the magnetization (the sum of spins), 
after each of 250 iterations, where each iteration consists of an update of each spin in turn. 

The set of standard simulations with multiple random number streams (top) is visually indistinguish- 
able from the set of permutation MCMC simulations with a single random number stream (bottom). 
In contrast, the coupled chains simulated in the standard way with a single stream (middle) quickly 
coalesce to the same state, and thereafter provide no more information than a single chain. 

I did a larger experiment with 100 parallel chains, each running for 1000 iterations. From the runs 
done using each method, I estimated the expectation of the energy, the expectation of the magnetization, 
and the expectation of the absolute value of the magnetization, along with standard errors for these 
estimates, computed assuming that the 100 parallel chains are independent. The results were as follows: 

Energy Magnetization | Magnetization | 



Standard MCMC 
Coupled MCMC 
Permutation MCMC 

Symmetry / Long run 



-27.003 ± 0.071 +0.006 ± 0.339 14.769 ± 0.040 

-27.725 ± 0.002 +4.591 ± 0.027 15.186 + 0.001 

-26.914 ± 0.070 -0.389 ± 0.336 14.720 ± 0.040 

-26.944 + 0.020 14.746 + 0.012 
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Figure 1: Six Gibbs sampling simulations (in different colours) for a 4 x 5 Ising model with j3 = 0.4, 
using standard MCMC, coupled MCMC, and permutation MCMC. The trace plots show energy (left) 
and magnetization (right) after each iteration. 
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The last line above gives more precise values. From symmetry, the true expectation of the magnetiza- 
tion is zero. For the energy and the absolute value of the magnetization, a more precise estimate of 
expectation was obtained by using a standard MCMC run that simulated 200 chains for 5000 iterations. 
Within their standard errors, the results above from standard MCMC and permutation MCMC are 
consistent with these more precise values (and with each other). The standard errors from standard 
MCMC and permutation MCMC are very close. As one would expect, the results from coupled MCMC 
are much less accurate, and the standard errors based on assuming the chains are independent are much 
too small. 

I will illustrate parallel simulation using permutation MCMC on a continuous distribution using the 
bivariate normal with means of zero, standard deviations of one, and correlation 0.95, truncated to the 
interval (—1, 2.5) for the first coordinate and the interval (—1.5, 2) for the second coordinate. I tried 
both Gibbs sampling and single-variable Metropolis updates, using the program in Appendix F. 

Figure [2] shows Gibbs sampling simulations for this truncated normal distribution. The permutation 
MCMC method for this simulation uses equations (|38|) and (|39p only — values for y* and v* are not 
needed in this application, and hence neither are equations (|40p and (|4ip . 

Coupled Gibbs sampling for this distribution leads to all chains approaching each other very closely 
within a few iterations. In contrast, the simulation done with permutation Gibbs sampling is visually 
indistinguishable from standard Gibbs sampling with separate random number streams. 

Figure [3] shows single- variable Metropolis sampling simulations for the truncated normal distribution. 
In the permutation MCMC version, each coordinate update is done using equations f|28[) to (|32p . using 
a random- walk proposal as in equation (|36p . with the same 5 values used for all chains, chosen randomly 
from the normal distribution with mean zero and standard deviation 4. 

The coupled Metropolis simulation again shows a strong dependence between the chains, although 
unlike Gibbs sampling, the states for different chains remain far enough apart to distinguish in the plot. 
The permutation Metropolis simulation with a single random number stream appears visually similar 
to the standard Metropolis simulation with multiple streams. 

In a larger experiment, I ran each method with 100 parallel chains, for 1000 iterations. The expec- 
tations of the two coordinates, and of their squares, were estimated from iterations after the first ten, 
and standard errors for these estimates were found on the assumption that the chains are independent. 
The results were as follows: 





1st 


coord 


2nd 


coord 


1st squared 


2nd squared 


Standard Gibbs sampling 


0.2274 


± 0.0083 


0.2109 


± 0.0083 


0.5775 ± 


0.0073 


0.5909 


± 


0.0067 


Coupled Gibbs sampling 


0.1249 


± 0.0002 


0.1011 


± 0.0002 


0.4955 ± 


0.0003 


0.5102 


± 


0.0002 


Permutation Gibbs sampling 


0.2333 


± 0.0070 


0.2156 


± 0.0072 


0.5874 ± 


0.0072 


0.6013 


± 


0.0066 


Standard Metropolis 


0.2511 


± 0.0144 


0.2355 


± 0.0149 


0.5997 ± 


0.0125 


0.6144 


db 


0.0120 


Coupled Metropolis 


0.4259 


± 0.0074 


0.4016 


± 0.0076 


0.7105 ± 


0.0092 


0.6883 


db 


0.0089 


Permutation Metropolis 


0.2658 


± 0.0162 


0.2468 


± 0.0164 


0.6243 ± 


0.0155 


0.6354 


db 


0.0143 


Long simulation run 


0.2329 


± 0.0012 


0.2162 


± 0.0012 


0.5821 ± 


0.0011 


0.5962 


± 


0.0010 



The more precise estimates in the last line above are from a standard Gibbs sampling run that simulated 
200 chains for 20000 iterations. 
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Figure 2: Six Gibbs sampling simulations for a truncated bivariate normal distribution, using standard 
MCMC, coupled MCMC, and permutation MCMC. 
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Figure 3: Six single-variable Metropolis simulations for a truncated bivariate normal distribution, using 
standard MCMC, coupled MCMC, and permutation MCMC. 
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From this table, we see that, as for the Ising model, coupled MCMC does not work well for a parallel 
simulation with a single stream, but the accuracy of permutation Gibbs sampling and permutation 
Metropolis with a single random stream is close to that of standard simulation with multiple random 
streams. (In the table, there appears to be a slight advantage for permutation Gibbs sampling, and 
a slight disadvantage for permutation Metropolis, but neither is seen consistently in replications with 
different random seeds.) 

Will parallel simulation using permutation MCMC with a single random number stream always 
behave similarly to standard MCMC with multiple streams? For simplicity, consider the case of just two 
chains. With standard MCMC with two random streams, the states of these chains will be independent 
(assuming independent initial states). The map defined by permutation MCMC on its extended state 
space (consisting of x or x*, u or u* , and possible y or y* , etc.) will be a permutation (or a volume- 
preserving map). The map on the joint space for two chains will also be a permutation (or will be 
volume-preserving), and hence will leave the uniform distribution on the pair of (extended) states for 
the two chains invariant. In this uniform joint distribution, the states of the two chains are independent. 

However, this does not imply that parallel simulation with permutation MCMC will produce chains 
with independent states, because the map on the joint space produced by permutation MCMC may 
not be ergodic. Indeed, if the extended state space is finite (as in Section [2D, it cannot be ergodic - 
if the two chains are in the same state, they will remain in the same state, so there are at least two 
disconnected regions of the joint state space that both have non-zero probability. 

This particular non-ergodicity will actually make permutation MCMC perform better than standard 
MCMC, provided the initial states of the chains are chosen to be different, since it will induce negative 
correlations that improve the accuracy of estimates (though the effect is probably negligible in real 
problems with huge state spaces). However, one may wonder whether there might sometimes be other 
non-ergodic aspects of the permutation MCMC map as applied to multiple states, which might lead to 
worse performance compared to standard parallel MCMC. Of course, as long as the MCMC method is 
ergodic for one chain, the results found using permutation MCMC will be valid — only the degree of 
improvement in accuracy from using multiple chains is at issue. 



6 Application to importance sampling 

Another application of permutation MCMC is to improving a deficient, but not too deficient, importance 
sampling distribution. 

Recall that the expectation of some function a(x) with respect to some distribution of interest, with 
probability density (or mass) function proportional to tt(x), can be estimated from points x±, . . . ,xn 
drawn independently from some importance sampling distribution, whose probability density (or mass) 
function is proportional to p(x), as follows: 

N 

J2 a(xi)Tr(xi)/p(xi) 

E n [a(x)} « ^ (51) 

E n(xi)/p(xi) 

i=l 

Methods for assessing the accuracy of such estimates are discussed by Geweke (1989) and Neal (2001, 
Section 3). 
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Suppose that we have found some tractable importance sampling distribution — for which it is 
feasible to sample independent points, and to compute p(x) for these sampled points — but that this 
importance sampling distribution gives very low probability to some region that has non-negligible 
probability under the distribution of interest. Such an importance sampling distribution will not work 
well, since it provides little or no information about this region, even though it cannot be ignored. 

I will demonstrate here that permutation MCMC can be used to improve such a deficient importance 
sampling distribution, as long as there is substantial overlap between the original importance sampling 
distribution and the distribution of interest (that is, their total variation distance is not close to one). 

The improved importance sampling distribution is defined as the mixture of distributions obtained by 
sampling from the original importance sampling distribution and then applying a permutation MCMC 
map for some number of iterations between to M, chosen randomly, here with equal probabilities 
(though unequal probabilities could be used). Since the permutation map is on an extended state 
space, the value for x or x* sampled from the importance sampling distribution must be supplemented 
with a value for u or «*, plus for non-uniform distributions, a value for y or y* , and for continuous 
transitions, a value for v*. These additional parts of the extended state are drawn uniformly from their 
allowed range, which is [0, 1) except for y or y*, whose range is [0,tt(x)) or [0,ir(x*)). These additional 
variables are drawn independently of each other and of x or x* (except for the dependence of the range 
of y or y* on x or x*). 

Before drawing points from the improved importance sampling distribution, we must fix partic- 
ular permutation MCMC maps, for up to M iterations, by sampling so, si, . . . , sm-i, as well as 
to,t±, . . . , tu-i and So, Si, ... , Sm-i if required. We can then sample points as follows: 

1) Sample k uniformly from {0, . . . , M}. 

2) Sample x k or x* k according to p. 

3) Sample other variables in the extended state (eg, u* k ) uniformly from their allowed range. 

4) Simulate M — k permutation MCMC transitions, using s&, . . . , sm-i (and also t^, ■ ■ ■ , tM-\ and 
<5fc, . . . , Sm-i if required), producing extended states indexed by k + 1, . . . , M. (If k = M, no 
transitions are simulated.) 

5) Let the extended state indexed by M be the point drawn from the improved importance sampling 



Supposing that it is feasible to compute p(x), the probability density (or mass) of the point sampled 
above under the improved improved importance sampling distribution can also be computed, with the 
following additional steps: 

6) Do a reverse permutation MCMC simulation for k iterations, starting with the originally sampled 
Xk or x* k , along with the other variables in the extended state indexed by k, using s^-i, ■ ■ ■ , so (and 
also tfe-i) ■ ■ ■ ,to and ■ ■ ■ ,So if required), producing extended states indexed by k— 1, . . . , 0. 
(If k = 0, no reverse transitions are simulated.) 

7) For j = 0, . . . , M, compute p(xj) or p(x*), and then, for a uniform distribution, compute 



distribution. 




(52) 



22 



and for a non-uniform distribution, compute 



i M i 
1 ^ p[Xj 



5 t(4o (53) 



i=o 

or similarly for a continuous state, cc*. 

The extended states indexed by 0, . . . , M are the possible start states that map to the state indexed 
by M, when between M and permutation MCMC iterations are done. Since each map is a permutation, 
or preserves volume, the probability of obtaining the extended state indexed by M with some number 
of iterations is simply the probability of sampling the Xj in the corresponding start state from the 
original importance sampling distribution, times the probability of sampling the additional variables in 
the extended state. 

The total (possibly unnormalized) probability mass or density for obtaining the sampled point from 
the improved importance sampling distribution is p, which simply adds the probabilities of obtaining 
the point from all possible start states. The division by ir(xj) in equation (|53p results from the range 
of y* being [0, Tr(xj)), so that the density of y* is l/"7r(xj). 

If we perform the above procedure N times, we will obtain N points from step (5), which we can 
label Xi for i = 1, . . . , N, along with corresponding quantities p\ from step (7). The desired distribution 
on the extended state is uniform (tt being accounted for in the range of y*), so the improved importance 
sampling estimate for the expectation of a(x) is 

N 

a(xi)/pi 

KHx)} « ^ (54) 

EVA 

i=i 

When M = 0, this reduces to the original importance sampling estimate of equation (|51|) . 

I will demonstrate how this method can sometimes improve an importance sampling distribution using 
a two-dimensional test distribution, in which the first coordinate, a;W , is normally-distributed with mean 
zero and standard deviation one, and given a value for the first coordinate, the second coordinate, x^ , 
is normally-distributed with mean [a^ 1 )] 2 — 1 and standard deviation one. I will consider using the 
method to improve on four importance sampling distributions, all of which are bivariate normal with 
zero correlation and equal standard deviations for the two coordinates, but which differ in the means 
for the two coordinates and in the standard deviation of the coordinates. Figure H] shows samples of 
500 points from the four importance sampling distributions (in red), each shown along with a sample 
of points from the test distribution (in black). 

For MCMC sampling from this test distribution, I use random-walk Metropolis transitions that 
update both coordinates simultaneously, with the proposal offset drawn from the bivariate normal 
distribution with zero correlation, zero mean, and standard deviation of 4 for both coordinates. The 
permutation MCMC implementation of this transition is done by fixing So, Si, ... , 5m—i to values drawn 
from this proposal offset distribution, and using them as described at the end of Section [3l The program 
for this permutation MCMC method and the importance sampling procedure above that uses it is shown 
in Appendix G. 

Figure [5] shows distributions obtained by drawing points from two of the importance sampling dis- 
tributions and then applying between zero and 10 iterations of this permutation MCMC map. The 
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mean 1 , & std dev 0.3, 0.3 



mean 1 , & std dev 1 , 1 



mean 1 , & std dev 3, 3 



mean 0, 4 & std dev 0.3, 0.3 




■J.S ■r'.-i- ■Jwa .-. 





Figure 4: Four initial importance sampling distributions for the test distribution. Each plot shows 
500 black dots from the test distribution and 500 red dots from initial independent normal importance 
sampling distributions with means and standard deviations as shown. 



mean (1,0), sd 0.3, seed 1 



mean (1 ,0), sd 0.3, seed 2 



mean (1,0), sd 0.3, seed 3 



mean (1,0), sd 0.3, seed 4 






mean (0,4), sd 0.3, seed 1 



mean (0,4), sd 0.3, seed 2 



mean (0,4), sd 0.3, seed 3 



mean (0,4), sd 0.3, seed 4 



Mr \ S-* 






Figure 5: Importance sampling distributions obtained by applying from to 10 permutation MCMC 
iterations starting from the leftmost and rightmost importance sampling distributions in Figure [H The 
four plots in each row are for different random selections of sq, . . . , sq and 5q, . . . ,5$. 
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top plots are for an importance sampling distribution (leftmost in Figure HJ that is quite concentrated 
(standard deviation 0.3) and located in an area of the test distribution of fairly high probability. The 
bottom plots are for an importance sampling distribution (rightmost in Figure 0|) with the same stan- 
dard deviation, but located in an area of very low probability under the test distribution. The four plots 
in each row show the importance sampling distributions obtained with four different random selections 
of s , •••,•89 and d ,. . . ,Sg. 

As can be seen in these plots, up to 10 iterations is not enough for the points sampled from these 
overly-concentrated importance sampling distributions to be transformed to have a distribution that 
is close to the test distribution. As the number of iterations is increased, the distribution for x^ 
and will approach the test distribution, but this alone is not enough to ensure that importance 
sampling will work well — the distribution of the entire extended state must be suitable. One way to 
visualize the adequacy of the importance sampling is to imagine drawing values for x^' and x^ from 
the test distribution, along with a value for u* uniformly from [0, 1) and a value for y* uniformly from 
[0, -^(a^ 1 ) , x^ 2 ))), and then applying the reverse permutation MCMC map for M iterations from this 
state. For the improved importance sampler to be adequate, it must almost always be the case that at 
least one of the states from the reverse simulation has values for and x^ that have reasonably high 
probability under the original importance sampling distribution. 

Figure [6] shows results on the test distribution using the four original importance sampling distribu- 
tions of Figure HI focusing on estimating the expectation of [x^ 2 - 1 ] 2 , whose true value is exactly 3. For 
each original distribution, four values of M were tried, with M = being equivalent to just using the 
original importance sampling distribution. For each value of M, results using three random seeds are 



mean 1 , & std dev 0.3, 0.3 



mean 1 , & std dev 1 , 1 



mean 1 , & std dev 3, 3 



mean 0, 4 & std dev 0.3, 0.3 
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20 100 500 




20 100 500 



20 100 500 



Figure 6: Estimates for the expectation of [x^ 2 ^] 2 from improved importance sampling using the four 
original importance sampling distributions shown in Figured! and with M set to (equivalent to the 
original distribution), 20, 100, and 500. Estimates with iV = 2000 are shown by dots, with vertical lines 
extending up and down by twice the standard error. For each value of M, estimates from three random 
settings of sq, . . . , sg and <5q, . . . , <5g are shown. The true expection of 3 is marked with a dotted line. 



25 



shown, which differ both in the points sampled from the original importance sampling distribution and 
in the random choice of so, ... , sm~i and 6q, . . . , 6m-i- The estimates were based on N = 2000 points. 
Standard errors for the estimates were calculated with the procedure at the end of Appendix G. 

From the results with M = 0, we see that of the original importance sampling distributions, only 
the third, with standard deviations of 3, produces adequate results. The more concentrated importance 
sampling distributions give very low probability to some high-probability regions of the test distribution, 
with the result that the estimate obtained is far from the truth, and moreover the estimated standard 
error is much too small. 

The first and second importance sampling distributions are improved by applying permutation 
MCMC. For the first importance sampling distribution, using M = 20 produces estimates that are 
rather inaccurate, but at least the standard errors properly reflect this. Good estimates are obtained 
when M = 100 or M = 500. For the second importance sampling distribution, even M = 20 gives rea- 
sonably good estimates, and realistic standard errors, and the estimates improve (with some variation) 
with larger M. The improvement seen is as expected based on the discussion above, since 20 or more 
Metropolis iterations started at a point drawn randomly from the test distribution are indeed likely 
to produce at least one state from the region with high probability under either of these two original 
importance sampling distributions, since they have substantial overlap with the test distribution. 

In contrast, for the fourth importance sampling distribution, only a slight improvement is seen from 
using permutation MCMC. Even with M = 500, the estimates are very inaccurate, though perhaps the 
standard errors are more realistic than for M = 0. This is also as expected given the discussion above 
— the fourth importance sampling distribution is concentrated almost entirely in a low-probability area 
of the test distribution, so even 500 Metropolis iterations from a point randomly sampled from the 
test distribution are likely to all remain in regions with low probability under the original importance 
sampling distribution. 

Permutation MCMC also produces only a slight improvement for the the third importance sampling 
distribution. This distribution is diffuse enough to cover the entire test distribution. Such a simple 
diffuse importance sampling distribution will usually be ineffective for a high-dimensional problem, 
because too many points drawn from it will land in regions of very low probability under the distribution 
of interest. But for this two-dimensional test problem the effective sample size is reduced by only a factor 
about 5.5, so acceptable results are obtained. We might hope, nevertheless, that applying permutation 
MCMC would improve performance, but even for M = 500, the inefficiency factor is reduced only to 
about 3.0. 

This might seem puzzling, since a point drawn from this diffuse importance sampling distribution is 
likely to map to a point that has high probability density under the test distribution within a fairly small 
number of Metropolis updates. However, points derived from original points with low density under the 
test distribution have very high density under the improved importance sampling distribution, because 
the value of y* for such points is drawn from the narrow range [0, ir(x^' , x^)). Such points therefore 
receive a very low weight in the estimate of equation (|54p . reducing the effective sample size. All sampled 
points will receive nearly equal weight only when M is so large that M reverse transitions started from 
a point drawn randomly drawn from the test distribution are likely to contain several points with low 
density under the test distribution, comparable to the low density of many points drawn from the 
original importance sampling distribution. 
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This method for improving importance samplers using permutation MCMC could itself be improved 
in several respects. First, once random values for so, ... , sm-i and <5o, . . . , Sm-i have been chosen, N 
points from the improved importance sampling distribution could be simulated in parallel. This is most 
easily done if the random choice of k from {0, . . . , M} is replaced by stratified sampling (as is desirable 
in any case), so that exactly N/(M+1) points are generated with each value of k. The simulations for 
each value of k could then be done with the vectorized permutation MCMC procedure. 

Secondly, the procedure described above actually produces results from two improved importance 
sampling distributions, only one of which is used above. The second distribution would use the state 
indexed by after step (7) of the procedure above. More generally, several importance sampling 
distributions that use overlapping sequences sq, . . . and <5o, . . . could be defined, and points drawn from 
all these distributions could be obtained with much of the computation being in common. The results 
from all these importance sampling distributions could then be combined. 

Even with these improvements, the method described above will be restricted to improving impor- 
tance sampling distributions that already have substantial overlap with the desired distribution. For 
complex, high-dimensional problems, finding such a distribution may be difficult. However, I expect 
that permutation MCMC can be used in an annealing framework, as I have previously developed for 
Hamiltonian importance sampling (Neal, 2005), and applied to a wide range of complex distributions, 
including those with multiple isolated modes. As for any importance sampling procedure, it would also 
be possible to estimate the ratio of the normalizing constants for the importance sampling distribu- 
tion and the distribution of interest. These ratios are of great interest in both statistical physics and 
Bayesian inference. 

7 The role of randomness in MCMC simulation 

The initial work on Markov chain Monte Carlo by Metropolis, et al (1953), who used it to sample from 
the "canonical" distribution for a system of molecules, was soon followed by work on an alternative 
deterministic approach to molecular simulation (Alder and Wainwright, 1959). In the context of sta- 
tistical physics, these stochastic and deterministic approaches to simulation become equivalent in the 
limit as the size of the system increases. It is also possible to extend the state space of a deterministic 
dynamical simulation so that even for small systems it visits states according to the same distribution 
that a stochastic simulation would (Nose, 1984). 

Permutation MCMC provides another way of reducing, or even eliminating, the role of randomness 
in MCMC simulation. 

In a standard MCMC simulation, the initial states of the chains may be chosen randomly (though 
usually not from the desired distribution), or by using some non-random heuristic (such as starting at 
the mode), or more-or-less arbitrarily. Subsequent transitions are always random (in practice, pseudo- 
random) rather than deterministic. As long as the transitions are ergodic, the choice of initial states 
affects only the number of iterations needed to reach a good approximation of the equilibrium distri- 
bution, not the asymptotic results from averaging functions of state over subsequent iterations of the 
chains. 

With permutation MCMC, any choice of sq,Si,... (along with any other variables that define the 
permutation or volume-preserving map) will leave the desired distribution invariant — as has also been 
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noted by Murray and Elliott (2012) for their related MCMC method. It follows that if a set of initial 
states are drawn from the desired distribution (which will be uniform, if we extend the state as in this 
paper), estimates based on the sequences of states found by applying permutation MCMC from these 
initial states will be unbiased even if sq, s\, . . . are non-random. 

In most applications, however, we are not able to choose initial states from the desired distribution. 
If we use some other initial state distribution, correct results will be obtained by averaging over the 
sequence of states obtained using permutation MCMC only if the particular sequence sq, si,... that 
we use has a suitable ergodic property. Note that if sq,si,... is considered non-random, it will not 
be the case that the distribution of the state at time t will converge to the desired state as t goes to 
infinity — these state distributions will remain just as non-uniform as the initial state distribution. 
Instead, we can hope that time averages of functions of state will converge to their correct expectations. 
Characterizing when this will occur is an interesting topic for further research, which may relate to 
work on quasi-Monte Carlo methods for MCMC (eg, Chen, Dick, and Owen, 2011). 

Some empirical insight into this question can be obtained using the permutation MCMC procedures 
for the Ising model and the truncated normal distribution that were used in Section [5) 

Here are the results of 100 chains, run for 1000 iterations each, using permutation Gibbs sampling 
for the Ising model: 





Energy 


Magnetization 


Magnetization 


With s set randomly 


-26.914 ±0.070 


-0.389 ± 0.336 


14.720 ± 0.040 


With s repeating 


0.2, 0.6 


-24.201 ± 0.041 


±0.127 ±0.249 


13.633 ± 0.027 


With s repeating 


0.3 


-25.512 ±0.048 


-0.025 ±0.268 


14.178 ±0.027 


With s repeating 


0.213, 0.631 


-26.876 ± 0.061 


±0.055 ± 0.303 


14.705 ± 0.035 


With s repeating 


0.292 


-26.893 ± 0.054 


-0.266 ±0.271 


14.721 ± 0.032 


With s repeating 


0.237 


-26.877 ± 0.050 


-0.069 ± 0.270 


14.733 ± 0.029 


Symmetry / Lonj 


I run 


-26.944 ± 0.020 





14.746 ±0.012 



The first line above is the same as was reported for permutation MCMC in Section [5j The other lines 
are for runs in which sq, s\, . . . are set deterministically, by repeating a pair of values, or a single value. 
We see that incorrect results are obtained when repeating the pair 0.2, 0.6, or the single value 0.3, but 
results are consistent with the answer from symmetry or a long run when repeating the pair 0.213, 
0.631, or the single values 0.292 or 0.237. This difference is perhaps related to 0.2, 0.3, and 0.6 being 
rational with small numerator and denominator, but further research is needed to understand these 
results. 

Figure [7] shows six runs of permutation Gibbs sampling on the truncated normal distribution used in 
Section [SJ The top plot is for permutation MCMC with sq,s\, . . . set randomly, the same as the bottom 
plot of Figure [2j The middle plot is for sq> s l> • • • alternately set to 0.231 and 0.452. In the bottom 
plot, all the Si are set to 0.211. Visually, the two deterministic sequences appear to produce correct 
results. Figure [8] shows more results with all Sj set to a single value — to zero for the top plot, to 0.017 
for the middle plot, and to 0.211 for the bottom plot (as for Figure [7]). Setting all Sj to zero produces 
interesting results, but it clearly does not result in the correct distribution. Setting all Sj to 0.017 does 
appear to produce correct results. 



28 



Plot of x1 vs. x2 Trace plot of x1 
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Figure 7: Six permutation Gibbs sampling simulations for a truncated bivariate normal distribution, 
with s set randomly (as in the bottom plot of Figure [2]), set to alternating values of 0.231 and 0.452, 
and set always to 0.211. 
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Figure 8: Six permutation Gibbs sampling simulations for a truncated bivariate normal distribution, 
with s set always to 0, to 0.017, and to 0.211 (as in the bottom plot of Figure [7]). 
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To confirm these visual impressions, I did 100 runs of 1000 iterations with these deterministic settings 
for so, si, . . .. The estimates obtained, along with their standard errors, are shown below: 





-Lot LUU1U. 




\ (at omioroH 
Loh oLJU.cU.cU. 


ZiilU. DtJLLcxIcU. 


With s set randomly 


0.2333 ± 0.0070 


0.2156 ± 0.0072 


0.5874 ± 0.0072 


0.6013 ± 0.0066 


With s repeating 0.231, 0.452 


0.2408 ± 0.0119 


0.2245 ± 0.0122 


0.5792 ± 0.0094 


0.5944 ± 0.0090 


With s always 0.211 


0.2388 ± 0.0023 


0.2216 ± 0.0023 


0.5893 ± 0.0046 


0.6018 ± 0.0045 


With s always 0.017 


0.2267 ± 0.0055 


0.2087 ± 0.0052 


0.5969 ±0.0112 


0.6110 ±0.0109 


With s always 


0.4431 ± 0.0742 


0.4001 ± 0.0770 


1.0609 ± 0.0974 


1.1092 ±0.0828 


Long simulation run 


0.2329 ± 0.0012 


0.2162 ± 0.0012 


0.5821 ± 0.0011 


0.5962 ±0.0010 



All the runs except the one with s always set to zero produce results that are consistent with the long 
simulation run. However, the runs vary in efficiency, as seen from the standard errors of the estimates. 
Estimates with s repeating 0.231, 0.452 are less efficient than when s is set randomly. When s is fixed at 
0.211, however, the standard errors for the two coordinates are a factor of three smaller (corresponding 
to a factor of nine better efficiency), and the standard errors for the squares of the coordinates are a 
factor of 1.5 smaller (a factor of 2.25 more efficient). When s is fixed at 0.017, the standard errors for 
the coordinates are smaller than for random s, but the standard errors for the squares of the coordinates 
are larger. 

The source of the greater efficiency when s is fixed at 0.017 or 0.211 can be seen from an examination 
of the middle and bottom plots in Figure O In the middle plot, for s = 0.017, the two coordinates 
move upward in a slow but consistent fashion for many iterations, and then quickly move downward, 
before resuming their upward motion. The same behaviour is seen in the bottom plot, for s = 0.211, 
except that the upward motion is quicker, making the pattern less obvious. Such consistent upward 
motion moves around the distribution more efficiently than the random walk motion characteristic of 
standard Gibbs sampling — n steps that are taken consistently in one direction lead to a point about 
n steps away from the start point, whereas n steps taken in a random walk lead to point that is likely 
only about y/n steps away. In this respect the behaviour with s fixed to a small value resembles that 
of "overrelaxation" methods (Adler, 1981; Neal, 1998), though the dynamics of overrelaxation is quite 
different. 

Exploring permutation MCMC methods with non-random values for sq, si, . . . (and <5o, S\, . . . where 
needed) seems to be a promising avenue to finding more efficient MCMC methods, but better theory 
regarding when such non-random methods will be ergodic would be helpful. However, as noted by 
Murray and Elliott (2012), one can ensure ergodicity (assuming the chain is ergodic when simulated in 
the standard manner) by combining non-random selections of Sj with occasional sequences of randomly 
chosen Sj that are long enough to move anywhere in the state space. 
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Appendix A: Proof that the map for a simulation of a uniform 
discrete distribution is a permutation 

To show that the map (xj,Uj) — > (xj+i, defined by equations ([2]) and ([3]) is a permutation, it 
suffices to show that the following map is its inverse: 

x'-l 

x\ = max jx' : Q ^ T(x i+ i,x) < u i+1 - Sj (mod Q)| (55) 

2=0 

x\ — l x i+1 — 1 

u\ = (m+i-Si) - Q ^2,T(x i+ i,x) + Q y~] T(x],x) (mod Q) (56) 

X=0 2 = 

That is, we need to show that equations ([2]), ©, ([55]) . and (f56]) imply that x\ = Xi and u\ = Ui. 
Equation ^ implies that 

Xi+l-l 

< Ui - Q ^2 T ( x i, x ) < QT(xi,x i+ i) = QT(x i+ i,Xi) (57) 

x=0 

and hence 

Xi— 1 x i+l — 1 2'i~ 1 

< Q T(x i+1 ,x) < - Q T(xi,x) + Q^T(x m ,x) < Q^T(x i+ i,x) < Q (58) 

2=0 2=0 2=0 2=0 

It follows from this and equation ([3]) that 

2i + l-l 2;-l 

■"i+i - Si (mod Q) = Ui - Q T(xj,x) + Q^T(x i+ i,x) (59) 

2=0 2=0 



Hence we can rewrite equation ()58[) as 

Xi— 1 2i 

< Q } y T(x i+ i,x) < - s i+ i (mod Q) < Q^jT(x i+ i, x) (60) 

2=0 2=0 

which, from equation (|5"5"j) . implies that x\ = Xj. In equation (f5T)j) . we can now replace xj with x« and 
replace Uj+i with the right side of equation Q, leading to the conclusion that u\ = u, L . 

Appendix B: Proof that the map for a simulation of a discrete 
distribution is one-to-one and volume preserving 

We can show that the map from (xi,y*,u*) to (x, + i, it* +1 ) given by equations ([22]) to (p^|) is 
one-to-one by seeing that the following map is its inverse: 

x'-l 

x\ = max jx' : T(xj + i,x) < u* +1 — s* (mod 1)| (61) 

2=0 

xl-l 

y? = Tr(xt) - s * (mod 1)) - ^f(x 4+1 ,x))/f(x m ,xj) (62) 

2=0 

= £ T(xJ,x) + T(xJ,x m )^i (mod i) ( 63 ) 

2=0 TT^i+U 
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That is, we will show that x\ = Xi, y*} = yi, and u*^ = u*. From equation ([24"]) and the fact that y* is 
in [0,7r(xi)) we see that 



Xi-l 



H+l 



s* (modi) = } y T(x i+1 ,x) + T(x i+1 ,: 



x=0 



n(xi) 



(64) 



since the right side will be in the interval [0, 1). Substituting this in equation (|6ip . we get 

x' — l Xi — l 

:| a;': y~) T(x i+ i,x) < j~) T(x i+ i,x) + T(x i+1 , x^) -^-y } = x % (65) 



max 



x=0 



x=0 



After replacing x\ with x^ in equations (j62|) and (j63j) . and using equations (|64p and ()23|) . we find that 
Vi = Vi and uf = u*. 

To see that volume is preserved by the map defined by equations ([22]) to ([24"]) we can look at the 
Jacobian matrix for the continuous part of this map, from (y*,u*) to (y* +1 , u* +1 ), over a region where 
Xi + \ is constant: 



dy* dy* 
du* du* 





7r(xj + i) 
T(x i ,x i+ i) 



T(x i+ i,Xj) 








7r(x i+ i) 



T(xj,x i+ i) 



T(xi, x 



(66) 



The absolute value of the determinant of this Jacobian matrix is one, so the map preserves volume. 

Appendix C: Proof that the map for a Metropolis-Hastings 

simulation of a discrete distribution is one-to-one 
and volume preserving 

The map from (xi,y*,u*) to (xi + i,y* +1 ,u* +1 ) defined by equations ([28]) to ([32]) is its own inverse when 
s* = 0, and in general its inverse is given by 



x'-l 



max 



lx': ^2S(x i+1 ,x) < u* +1 - s* (mod 1)} 

2 = 

(u* +1 -s* (modi)) - ^2S(x i+1 ,x)j / S(x i+1 ,x\] 



y, 



*t 



x=0 

x\ if a\ < a(x i+ \,x\) 
x; L+ \ otherwise 

77 ( x l) a i I a(x i+1 ,x\) if a\ < a(x i+1 ,x\) 



Vi+l 

( Xi+i-l 



otherwise 



,*t 



^ S{x\,x) + S(x\, x i+ i) a(x\, x i+ i) (modi) if a\ < a(x i+i , x J) 



x=0 

s * ( mod 1) 



(67) 

(68) 
(69) 

(70) 
(71) 



otherwise 
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We can show that x\ = Xj, y*} = y*, and u*^ = u*, separately for rejected and accepted proposals. 
I will do this here assuming = 0; the general case follows from this, since adding s* to u* is itself 
one-to-one and volume-preserving. 

If the proposal Xj+i is rejected — that is, if Oj+i > a(xj,Xj + i) — then x* +1 = x*, y* +1 = y*, and 
u i+i = U ti an d it is easy to see that the inverse map above will also leave x*, y*, and u* unchanged. 

If the proposal Xi + \ is accepted, when Oj+i < a(xi, Xi + i), one can see that the way that u* +1 is set 

that is, the proposal will be the the previous state. This proposal will be 



ensures that x\ = 
accepted, since 

tfsl-i 



,t 



H+l 



x=0 



Xi-1 



y^S(x i+ i,x)) I S(x i+ i,x\) = (u* i+l -^S(x i+ i,x)) I S(x i+ i,Xi) = a(x i+1 ,Xi) 



x=0 



< a(x i+1 ,Xi) = a(x i+ i,x\) 



7r(xi) 
(72) 



Since the proposal is accepted, x\ = Xj. From a\ = a(xi + \, Xi) y* / 7r(xj), one can also see that y*' = y\. 
From equation (p0|) . we see that 

u i = S(xj,x) + a i+1 S(xi,x i+ i) (73) 



,t 



*t 



x=0 



and from equation (i3T]l we see that 



a(x h x i+1 )y* +1 /ir(x i+ i) 



(74) 



from which it follows that u 



*t 



uj. 



To confirm that the map defined by equations f)28f) to ()32f) preserves volume, we look at the Jacobian 
matrix for the continuous part of the map, from (y*,u*) to For a rejected proposal, when 

dj > a(xi,Xi), this is the identity matrix. For an accepted proposal, the Jacobian matrix is as follows 
(noting that x,{ = x«+i): 



dy*+t d < 



+i 



dy* dy* 

dy*+i du t+i 
du* du* 







S(x i+ i,Xi) a(x i+ i,Xi) 



7r(xi) 



S(xi,x i+ i)a(xi,Xi + i] 




(75) 



mm 



S(x i+ i,Xi) S(xi,x i+ i) 



ir(xi) 



1 



S(xi,x i+ i) S(x i+ i,Xi) 



mm 



ir(x i+1 ) ' n(xi) 




(76) 



Since the absolute value of the determinant of this Jacobian matrix is one, the map must preserve 
volume. 
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Appendix D: Proof that the map for a simulation of a continuous 
distribution on an interval (a, b) is one-to-one and 
volume preserving 



The map from (x*,u*,y*,v*) to (x i+1 , ^ 
because it has the following inverse 



x 



*t 



n - : +i> Vi+ii v i+i) defined by equations (|38l) to (HTj) is one-to-one 
F~ x (x* i+X , u* +1 -s* (mod 1)) 



u 



*t 



,*t 



,*t 



Showing the x*^ = x*, uj 1 = u*, yj 1 = y* , and uj 1 = v* is straightforward except for issues involving 
ir*(x*) or T*(x*,:e*T) being zero. Division by zero in equation (I80p will not occur because the state 
space excludes points where tt*(x*) = 0, as no y* value is allowable with such an x*. When T*(x*, x*>) 
may be zero, we define F~ 1 (x* ,u*) to be the largest x*' for which F*(x* ,x* r ) = u* . F~ l will fail to 
invert F* only where the reverse transition density is zero, but this cannot happen for a transition that 
was taken, for which the forward transition density must have been non-zero. 



,*t 



,*t 



F#(x i , x* +1 ) 

TT*(x?)(v* +1 -t* (modi)) 

*t 



(77) 
(78) 
(79) 
(80) 



To see that the map defined by equations (J38J) to (I4ip preserves volume, we look at the determinant 
of its Jacobian matrix, which is 



r 9x* +1 


d< +1 


dy*+i 




dx* 


dx* 


dx* 


dx* 


dx* +1 




dyt+i 




du* 


du* 


du* 


du* 


dx* +1 


du* + i 




dv* +1 


dy* 


dy* 


dy* 


dyt 


dx* +1 


9u* +1 


dy*+i 




dv* 


dv* 


dv* 


dv* 



dx* 
1 



T*(x* 



+ D- 



dx* 



i+l 



dx* 



D 



T*(x*,x* +1 ) 




T*(x*,x* +1 ] 




1 











7T*(x*) 





(81) 



Here, D = (d/dx* +1 )F*(x* +1 ,x*) and x marks elements of the matrix that are irrelevant given the zero 
elements elsewhere. The determinant of this matrix has two non-zero terms, whose sum is 



T*(x* +1 ,x*) + D 



dx* 



1 



1 



T*(X*,X* +1 ) 7T*(x*) 
1 



dx 



i+l 



D 



1 



T*(x* +l ,x*) 



1 



7T [Xa 



dx* T*(X*,X* +1 ) 7T*(x 
s . 7r *«+l)^*«+l^ 



T*(x*,x* +1 ) n*(x*)- " i+v 
Since the determinant is one, the map preserves volume. 



7T*(X*)T*(X*,X* +1 ) 



1 (82) 
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Appendix E: R Program for Ising Model Simulations 



# SETUP FDR ISING MODEL. Stores information in global variables for later use. 
ising_setup <- function (n_rows, n_cols) 

{ n_rows «- n_rows; n_cols «- n_cols; n_spins «- n_rows * n_cols; n_neighbors «- 4 
neighbor <<- matrix (NA, n_spins, n_neighbors) 
for (j in l:n_spins) 

{ neighbor [j , 1] «- if ( j°/,°/,n_rows==0) j-(n_rows-l) else j+1 
neighbor [j , 2] <<- if ( j / / n_rows==l) j + (n_rows-l) else j-1 

neighbor [j , 3] <<- if ( (j-l)7o/7on_rows==n_cols-l) j-n_rows*(n_cols-l) else j+n_rows 
neighbor [j ,4] «- if ( (j-l)7o/7 n_rows==0) j+n_rows* (n_cols-l) else j-n_rows 

} 



# VECTORIZED MCMC FOR ISING MODEL . Method is "s" for standard simulation with multiple 

# random streams, "c" for coupled simulation with a single stream, and "p" for permuation 

# MCMC with a single stream. Initial values for states, u, and yf (y divided by pi) can be 

# specified, or left to be generated randomly. The "random" numbers can be supplied via the 

# s argument (of dimensions n_iters by n_spins) , and are otherwise set here using runif . 

# If the rev argument is TRUE, the reverse map is applied. The value returned is an array 

# of states at all iterations, with attributes giving end values of states, u, and yf . 

vec_ising <- function (beta, method, n_chains, n_iters, states=NULL, u=NULL, yf=NULL, 

s=NULL, rev=FALSE) 

{ if (is.null(states)) 

states <- matrix (sample (c(-l,l), n_chains*n_spins , replace=TRUE) , n_spins, n_chains) 
if (is. null (u) && method=="p") u <- runif (n_chains) 
if (is. null (yf) && method=="p") yf <- runif (n_chains) 
if (is.null(s) && (method=="p" II method=="c") ) 

s <- matrix(runif (n_iters*n_spins) ,n_iters ,n_spins) 



if (rev) s <- s [n_iters : 1 , ,drop=FALSE] 

results <- array (dim=c (n_spins ,n_chains ,n_iters) ) 



for (i in l:n_iters) 

{ for (j in { if (rev) n_spins:l else l:n_spins }) 
{ 

if (method=="p" && rev) u <- (u - s[i,j]) 7,7, 1 # Take s from u first for reverse map 
neighbor_sum <- rep(0,n_chains) 

for (k in 1 :n_neighbors) neighbor_sum <- neighbor_sum + states [neighbor [j ,k] , ] 



if (method=="s") u <- runif (n_chains) 

if (method=="c") u <- s[i,j] 

if (method=="p" ) old_spin01 <- states [j,] 



== 1 



# Old values of spins in 0/1 form 



pO <- 1 / (1 + exp(2*beta*neighbor_sum) ) 

spinOl <- pO <= u 

states [j,] <- 2*spin01 - 1 



# Probabilities of setting spin to 

# New values of spins in 0/1 form 

# New values of spins in -1/+1 form 



if (method=="p") 

{ t <- spin01*(l-p0) + (l-spin01)*p0 



# Probabilities of new spin values 
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old_t <- old_spin01*(l-pO) + (l-old_spin01)*pO # Probabilities of old spin values 
old_yf <- yf 

yf <- (u - p0*spin01) / t # New yf (equal to y/pi) from eq (23) 

u <- p0*old_spin01 + old_t*old_yf # New u (except for + s) from eq (24) 

if (!rev) u <- (u + s[i,j]) 7 7. 1 # Add s to u last for non-reverse map 

} 

} 

results [,,i] <- states 

} 

structure (results, end_states=states , end_u=u, end_yf=yf) 

} 

# COMPUTE THE ENERGY OF A SPIN CONFIGURATION . 

ising_energy <- function (spins) 
{ energy <- 

for (j in l:n_spins) energy <- energy - spins [j] * sum(spins [neighbor [j ,]] ) 
energy / 2 # Divide by two because each term in the energy was added twice above 

} 

# PLOT THE ENERGY ALONG ALL CHAINS. 
energy_plot <- function (res) 

{ n_chains <- dim(res) [2] ; n_iters <- dim(res) [3] 
E <- matrix(NA,n_iters,n_chains) 

for (i in l:n_iters) for (k in l:n_chains) E[i,k] <- ising_energy (res [,k, i] ) 
E_bar <- colMeans(E) 

cat ("Energy: mean " .round (mean (E_bar) ,3) , " +- " ,round(sd(E_bar)/sqrt(n_chains) ,3) , 
"\n",sep="") 

plot (c(l ,n_iters) , c(-2*n_spins , 2*n_spins) , type="n", xlab=" Iteration" , ylab="") 
for (k in l:n_chains) points (E[,k], col=k, pch=20) 

} 

# PLOT THE MAGNETIZATION (SUM OF SPINS) ALONG ALL CHAINS. 

magnet ization_plot <- function (res) 

{ n_chains <- dim(res) [2] ; n_iters <- dim(res) [3] 

M <- matrix(NA,n_iters,n_chains) 

for (i in l:n_iters) for (k in l:n_chains) M[i,k] <- sum(res [ ,k, i] ) 
M_bar <- colMeans(M); M_abs_bar <- colMeans (abs (M) ) 

cat ("Magnetization, mean ", round (mean (M_bar) ,3) , " +- " ,round(sd(M_bar)/sqrt(n_chains) ,3) , 
", mean abs ", round (mean (M_abs_bar) ,3) , " +- " ,round(sd(M_abs_bar)/sqrt(n_chains) ,3) , 
"\n" ,sep="") 

plot (c(l ,n_iters) , c(-n_spins, n_spins) , type="n", xlab=" Iteration" , ylab="") 
for (k in l:n_chains) points (apply (res [ ,k,] ,2, sum) , col=k, pch=20) 

} 
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Appendix F: R Program for Truncated Normal Simulations 



# VECTORIZED GIBSS SAMPLING FOR TRUNCATED NORMAL MODEL. Method is "s" for standard simulation 

# "c" for coupled simulation, and "p" for permuation MCMC. Initial values and the random 

# stream can be specified, or left to be generated randomly. If the rev argument is TRUE, 

# the map is reversed. The value returned is an array of states at all iterations, with 

# attributes giving end values of states and u. Uses global rho, lower, and upper variables. 

vec_trunc_normal_gibbs <- function (method, n_chains, n_iters, states=NULL, u=NULL, 

s=NULL, rev=FALSE) 

{ if (is.null(states)) 

{ states <- matrix (NA, 2, n_chains) 

for (j in 1:2) states [j,] <- runif (n_chains , lower [j], upper [j]) 

} 

if (is. null (u) && method=="p") u <- runif (n_chains) 

if (is.null(s) && (method=="p" II method=="c") ) s <- matrix(runif (2*n_iters) ,n_iters,2) 
if (rev) s <- s [n_iters : 1 , ,drop=FALSE] 

results <- array(dim=c(2,n_chains,n_iters)) 

for (i in l:n_iters) 

{ for (j in { if (rev) 2:1 else 1:2 » 

{ if (method=="p" && rev) u <- (u - s[i,j]) 7.7. 1 
if (method=="s") u <- runif (n_chains) 
if (method=="c") u <- s[i,j] 
if (method=="p") old <- states [j,] 

cond_sd <- sqrt (l-rho~2) ; cond_mean <- rho * states[3-j,] 

ql <- pnorm (lower[j], cond_mean, cond_sd) ; qu <- pnorm (upper[j], cond_mean, cond_sd) 
states [j,] <- # max & min below not usually needed - just in case of numerical issue 

pmax (lower[j], pmin (upper[j], qnorm (ql + (qu-ql)*u, cond_mean, cond_sd))) 
if (method=="p") 

{ u <- (pnorm(old,cond_mean,cond_sd) - ql) / (qu-ql) 
if (!rev) u <- (u + s[i,j]) 7.7. 1 

} 

} 

results [,,i] <- states 

} 

structure (results, end_states=states , end_u=u) 

} 

# VECTORIZED METROPOLIS SAMPLING FOR TRUNCATED NORMAL MODEL. Arguments are as above, but 

# with no "rev" argument, an extra optional "yf" argument, and an extra optional "delta" 

# argument that specifies random-walk proposal offsets for each variable update. 

vec_trunc_normal_met <- function (method, n_chains, n_iters, states=NULL, u=NULL, yf=NULL, 

s=NULL, delta=NULL) 

{ if (is.null(states)) 

{ states <- matrix (NA, 2, n_chains) 

for (j in 1:2) states [j,] <- runif (n_chains , lower [j], upper [j]) 

> 

if (is. null (u) && method=="p") u <- runif (n_chains) 
if (is.null(yf) && method=="p") yf <- runif (n_chains) 

if (is.null(s) && (method=="p" II method=="c") ) s <- matrix(runif (2*n_iters) ,n_iters,2) 
if (is. null (delta)) delta <- matrix(runif (2*n_iters , , 1) ,n_iters ,2) 
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results <- array(dim=c(2,n_chains,n_iters)) 

for (i in l:n_iters) 
{ for (j in 1:2) 

{ if (method=="s") u <- runif (n_chains) 

if (method=="c") u <- s[i,j] 

old <- states [j,] 

cond_sd <- sqrt(l-rho~2) ; cond_mean <- rho * states [3- j,] 

proposal <- (states [j ,] +delta[i , j] ) * (u<l/2) + (states [j ,] -delta [i,j] ) * (u>=l/2) 
a <- (2*u) 7.7. 1 

ar <- dnorm(proposal , cond_mean, cond_sd) / dnorm(old, cond_mean, cond_sd) 
ap <- (proposal >= lower [j]) * (proposal <= upper [j]) * pmin(l,ar) 
accept <- (a < ap) 

states [j,] <- accept*proposal + (1-accept) *old 
if (method=="p") 

{ u[accept] <- ((u<l/2)/2 + pmin(l , 1/ar) *yf/2) [accept] ; yf [accept] <- (a/ap) [accept] 
u <- (u + s[i,j]) 7.7. 1 

} 

} 

results [,,i] <- states 

} 

results 

} 

# PLOT XI VERSUS X2 FOR SIMULATIONS. 

bi_plot <- function (res, from=ll, xlim=c (lower [1] .upper [1] ) , ylim=c (lower [2] .upper [2] ) ) 
{ n_chains <- dim(res) [2] ; n_iters <= dim(res) [3] 

plot (c(), type="n", xlab="xl", ylab="x2", xlim=xlim, ylim=ylim) 

for (k in l:n_chains) points (res [1 ,k,f rom:n_iters] , res [2,k,f rom:n_iters] , pch=20, col=k) 

} 

# PLOT XI OR X2 VERSUS ITERATION NUMBER FOR SIMULATIONS. 

uni_plot <- function (res, j, ylim=c (lower [j] .upper [j] ) ) 
{ n_chains <- dim(res) [2] ; n_iters <- dim(res) [3] 

plot (c () , type="n" ,xlab=" Iteration" ,ylab=paste("x" , j , sep=" ") ,xlim=c(l ,n_iters) ,ylim=ylim) 

for (k in l:n_chains) points (res[j,k,], pch=20, col=k) 

} 

# FIND MEAN ESTIMATES AND STANDARD ERRORS FROM SIMULATIONS. 

estimates <- function (res, from=ll) 

{. n_chains <- dim(res) [2] ; n_iters <- dim(res) [3] 

ave <- ave_sq <- matrix (NA, 2, n_chains) 

for (k in l:n_chains) 

{ ave[,k] <- rowMeans (res [,k, from: n_iters] ) ; 
ave_sq[,k] <- rowMeans (res [,k, from: n_iters] ~2) 

} 

round (cbind (ave=apply (ave , 1 ,mean) , se=apply (ave , 1 , sd) /sqrt (n_chains) , 

ave_sq=apply (ave_sq, 1 ,mean) , se_sq=apply (ave_sq, 1 , sd) /sqrt (n_chains) ) , 4) 

} 
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Appendix G: R Program for Importance Sampling Test Simulations 

# LOG PROBABILITY DENSITY FUNCTION FOR TEST DISTRIBUTION. 

istest_log_density <- function (x) dnorm(x [1] , , 1 , log=TRUE) + dnorm(x [2] ,x [1] "2-1, 1 ,log=TRUE) 

# VECTORIZED PERMUTATION METROPOLIS FOR TEST DISTRIBUTION. Does random-walk Metropolis updates 

# for both variables at once. If delta is not supplied, it's normal with given proposal_sd. 

vec_istest_met <- function (n_chains, n_iters, states=NULL, u=NULL, yf=NULL, 

s=NULL, delta=NULL, rev=FALSE, proposal_sd) 

{ 

if (is.null(states)) states <- matrix (rnorm(n_chains*2,0, 1) , 2, n_chains) 
if (is. null (u)) u <- runif (n_chains) 

if (is. null (yf)) yf <- runif (n_chains) 

if (is.null(s)) s <- runif (n_iters) 

if (is .null (delta) ) delta <- matrix(rnorm(n_iters*2,0,proposal_sd) ,n_iters,2) 
if (rev) { s <- s [n_iters : 1] ; delta <- delta [n_iters : 1 , ,drop=FALSE] } 

results <- array(dim=c(2,n_chains,n_iters)) 

for (i in l:n_iters) 
{ 

if (rev) u <- (u - s[i]) 7.7. 1 

proposal <- t (t(states+delta[i,] ) * (u<l/2) + t(states-delta[i,] ) * (u>=l/2)) 
a <- (2*u) 7,7, 1 

ar <- exp (apply (proposal, 2, istest_log_density) - apply (states, 2, istest_log_density) ) 
ap <- pmin(l,ar) 
accept <- (a < ap) 

states <- t (accept*t (proposal) + (1-accept) *t (states) ) 
u[accept] <- ((u<l/2)/2 + pmin(l , 1/ar) *yf/2) [accept] 
yf [accept] <- (a/ap) [accept] 

if (!rev) u <- (u + s [i] ) 7.7. 1 

results [,,i] <- states 

} 

structure (results, end_states=states , end_u=u, end_yf=yf) 

} 

# METROPOLIS IMPORTANCE SAMPLING FOR TEST DISTRIBUTION. N is number of points, M is maximum 

# number of transitons, mu and sd are for normal start state distribution, and proposal_sd 

# is for the normal random-walk Metropolis proposals. 

istest_met_is <- function (N, M, mu, sd, proposal_sd) 
{ 

pr <- rep(NA,M+l); w <- rep(NA,N); beta <- matrix (NA,N,2) 

states <- matrix (NA, 2, M+l) ; s <- runif (M) ; delta <- matrix (rnorm(M*2,0,proposal_sd) , M, 2) 

for (i in 1:N) 
{ 

k <- if (M==0) else sample (0 :M, 1) # Pick number of transitions from start to end state 
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states [,k+l] <- rnorm(2 ,mu, sd) # Random start state 

u <- runif(l); yf <- runif(l) # Random start values for u and yf 



if (k<M) # Do a forward simulation from the start state on 

states [, (k+2) : (M+l)] <- 

vec_istest_met (proposal_sd, 1, M-k, cbind(states [,k+l] ) , u, yf , 

s=s[(k+l) :M] , delta=delta[(k+l) :M, ,drop=FALSE] , rev=FALSE) [,1,] 

if (k>0) # Do a reverse simulation before the start state 

states [,l:k] <- 

vec_istest_met (proposal_sd, 1, k, cbind(states [,k+l] ) , u, yf , 

s=s[l:k], delta=delta[l :k, ,drop=FALSE] , rev=TRUE) [,1,] 

beta[i,] <- states [, M+l] # Importance sampling point is the end state 

for (k in 0:M) 

pr[k+l] <- sum (dnorm (states [,k+l] ,mu, sd, log=TRUE) ) # Compute log probabilities 

- istest_log_density(states [,k+l] ) # of possible start states 

w[i] <- - (max(pr) + log(sum(exp(pr-max(pr)))) - log(M+l)) # Importance weight: minus the 
} # log of average probability 

list (w=w, beta=beta) 

} 

# ANALYSE IMPORTANCE SAMPLING OUTPUT. Takes as arguments a vector of log weights and a 

# matrix of beta values. Prints various estimates with standard errors. 

is_analysis <- function (w, beta) 
{ 

N <- length (w) 
mx <- max(w) 
w <- exp (w-mx) 

cat ("\nEstimated log normalizing constant:", round (mx+log(mean(w) ) ,3) , 
"+-", round(sd(w)/sqrt(N)/mean(w) ,3) , "\n\n") 

w <- w / mean(w) 

z <- sum(w) 

ave <- colSums (w*beta) / z 
ave_sq <- colSums (w*beta~2) / z 

cat ("Adjusted sample size:", N, "/", l+var(w), "=", N/(l+var (w) ) , "\n\n") 
beta_offset <- t (t(beta) - ave) 

ave_se <- sqrt (colSums ( (w*beta_off set) ~2) / N~2) 

beta_sq_of f set <- t (t(beta~2) - ave_sq) 

ave_sq_se <- sqrt (colSums ( (w*beta_sq_of f set) ~2) / N~2) 

print (cbind (ave=ave, se=ave_se, ave_sq=ave_sq, sq_se=ave_sq_se) ) 

} 
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